Wine Quality analysis is a statistical study of features affecting the quality of wine. In the Wine Quality data set there are about 11 features that affect it’s quality. The quality is measured on a scale of 0-10, this work envisons to study the specific features that play a key role in determining the quality of wine by two different statistical tools - Multiple regression and Logistic regressions. The box plots of features and their corresponding effects on Quality is shown for better and clear understanding. Due to some limitations in multiple regression such as collinearity and others, we have adapted Logistic regression. Finally as an out-of-the box step, we have tested the predicitability of this model to ensure if the model developed out of this data set can be used on some other data set.
ï..FA VA CA RS CL FS TS D PH S A Q Y
1 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5 0
2 7.8 0.88 0.00 2.6 0.098 25 67 0.9968 3.20 0.68 9.8 5 0
3 7.8 0.76 0.04 2.3 0.092 15 54 0.9970 3.26 0.65 9.8 5 0
4 11.2 0.28 0.56 1.9 0.075 17 60 0.9980 3.16 0.58 9.8 6 1
5 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5 0
6 7.4 0.66 0.00 1.8 0.075 13 40 0.9978 3.51 0.56 9.4 5 0
The abbrevations used have the following meanings:
1- FA - Fixed Acidity
2- VA - Volatile Acidity
3- CA - Citric Acid
4- RS - Residual Sugar
5- CL - Chlorides
6- FS - Free Sulfur Dioxide
7- TS - Total Sulfur Dioxide
8- D - Density
9- PH - pH
10- S - Sulphates
11- A - Alcohol
12- Q - Quality
Wine, once an expensive good is now increasingly enjoyed by variety of consumers. In fact, Portugal is one of the top ten wine exporting country with about 32% of the market share in 2005 [5]. It’s export has increased to about 36% in 2007. Therefore new technologies has been adapted to enhance the making and selling of this wine. In this process there are two major steps: Wine Certification and Quality Assessment.
While Certification ensures the prevention of illegal adulteration, Quality evaluation which is a part of certification process is an indicator that is used for improving the wine making by identifying the most important features which thereby helps to classify wines as premium brands.
Generally, wine certification is done using phsiochemical and sensory tests, wherein the former is used to characterize wine based on density, alcohol or pH values while sensory test rely on human senses. Since the taste is the least understood of human senses [6] the relationships between these two tests are very difficult to understand, wine classification becomes an onerous task.
In such an atmosphere with the help of the technologies the data pertaining to this Wine Quality are collected and stored. These data contain important informations that explains trends and features on which the quality of wine depends. Based on this data and its associated information it is possible to improvise the quality by performing statistical analysis.
Therefore in this work, we have collected the data set pertaining to Wine Quality [4] on which we performed two types of stastical analysis: Multiple regression and Logistic regression. With these analysis we have extracted the important features that affect the wine quality and validated it with measures of “Goodness of fit”. At the end as an out-of-the box initiative we performed a prediction on this data set besides classification by developing a model, ensure that this model can be used on some other data sets too.
In the previous sections a brief idea about the Wine quality data set was shown, while in this section this raw data needs to be analyzed. Since there are 11 independant variables with Quality being the response variable, there are two basic approaches that comes to our mind:
Multiple Regression - This is one of the basic regression models that can be applied to find out the nature of relationship between the dependant and the independant variable. It also helps us to determine the nature of relationship between the different variables in the data set.
Logistic Regression - Due to shortcomings in Linear regression such as its inability to deal with categorical variables, it will be better and ideal to use logistic regression. Along with this simple logistic regression, combining prediction based analysis of this model will provide a tangible conclusion.
But before performing any analysis, we need to determine the effect of features on the quality through box plot in the next section. With this idea we will first implement Mutliple followed by Logistic regressions.
In this figure a relationships between the dependant variables are shown. This shows the effect of one feature in presence of another, whether these are positively or negatively correlated or if they are not at all related.
Despite knowing the fact that this model violates the conditions of model adequacy, for the sake of knowing more diagnostic of this data set using the simple linear regressions was done and the following plots were done.
These plots totally do not make any sense. It is totally impossible to draw any conclusions from such diagnostics. The main reason for such improper plots is due to the consideration of the response variable. In my view I feel since the response variable is categorical in nature the linear or multiple regression analysis will not work here.
Before the application of the regression analysis, it is pertinent to check if the following conditions are satisfied or not. Violation of these conditions would result in unstable models.[7]
This test is conducted mainly to examine the normality condition of any given data set. The null hypothesis of this test is that the population is normally distributed.
In this case, our p-value is 2.2e-16 < 0.05 = \(\alpha\) the level of significance. Reject the null hypothesis. This shows that this data set is not normally distributed. Therefore the normality assumption is violated.
Shapiro-Wilk normality test
data: Z$Q
W = 0.85759, p-value < 2.2e-16
Collinearity, which is defined as the dependance relation between the independant variables is an important factor that needs to be considered. This might in fact sabotage the model by producing low and undesirable results.
There are two different methods to examine the collinearity in this case:
From pair-wise relationship plot
Varience Inflation factor (VIF)
From sheer obervation of the pair-wise plot we can see a positive correlation between pH and volatile acidity (VA); a negative correlation between fixed acidity and pH. Similarly, the citric acid, acidity, and pH are all correlated as they together determine the acidity.
VIF gives an excat values which aids us to conclude about the collinearity. Any VIF values greater than 2 can be considered to be suffering from collinearity. Since this model suffers from collinearity as well as fails to satisfy the normality assumption, multiple regression cannot be used to this data set
ï..FA VA CA RS CL FS TS D
2.887544 1.318209 1.710269 1.335600 1.198213 1.439199 1.494775 2.569294
PH S A Q
1.831266 1.186651 1.665065 1.098369
This table is similar to the regression analysis obtained using linear regression. This table comprises of all the variables that are both significant and non-significant. By removing the non-significant variables, we can determine the features affecting the Wine Quality.
Call:
glm(formula = Y ~ ., family = binomial(link = "logit"), data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.4025 -0.8387 0.3105 0.8300 2.3142
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 42.949948 79.473979 0.540 0.58890
ï..FA 0.135980 0.098483 1.381 0.16736
VA -3.281694 0.488214 -6.722 1.79e-11 ***
CA -1.274347 0.562730 -2.265 0.02354 *
RS 0.055326 0.053770 1.029 0.30351
CL -3.915713 1.569298 -2.495 0.01259 *
FS 0.022220 0.008236 2.698 0.00698 **
TS -0.016394 0.002882 -5.688 1.29e-08 ***
D -50.932385 81.148745 -0.628 0.53024
PH -0.380608 0.720203 -0.528 0.59717
S 2.795107 0.452184 6.181 6.36e-10 ***
A 0.866822 0.104190 8.320 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2209.0 on 1598 degrees of freedom
Residual deviance: 1655.6 on 1587 degrees of freedom
AIC: 1679.6
Number of Fisher Scoring iterations: 4
The summary of the statiscal output to model Wine Quality with the other variables is the shown in adjacent block. This summary comprises of all the variables, their p-values and their significance. By performing simple t-tests we can find out the variables that contributes significantly and does not. Based on this decision, the values that do not contribute significantly are removed and the corresponding Goodness of fit is computed. One sample t-test is shown below.
This same method shown below was used to check all the other variables in this model. \(\beta_{1}\) : Change in the Quality for a unit percent change in the fixed acidity.
\(H_{0}\) : \(\beta_{1}\) = 0 versus \(H_{1}\): \(\beta_{1}\) \(\neq\) 0.
The p-value is 0.16736 > 0.05 = \(\alpha\) the level of significance. Failed to reject \(H_{0}\). Therefore the fixed acidity does not contribute significantly to the model provided the other variables are present in the model.
In this way, the non-significant variables are identified (FA,RS,D,pH) and removed. Therefore the rest of the variables except for these 4 contribute the most to the given model as per the summary table. Now a new model based on these significant variables is developed, Goodness of fit, sensitivity, specificity, concordance, ROC, predictability is determined.
There is a famous quote \(\color{red}{\text{All Models are wrong but some are useful}}\). Since there are no good models, we can find out models that fits a given set of observations. To find and conclude how well a particular fits a set of observations \(\color{red}{\text{Goodness of fit}}\) is used.
There are several measures of fit available that can be used for computation purposes. These measures are sometimes classified into \(\color{red}{\text{Global}}\) and \(\color{red}{\text{Logical}}\). They are:
• Chi-square goodness of fit tests and deviance
• Hosmer-Lemeshow tests
• Classification tables
• ROC curves
• Logistic regression R2
• Model validation via an outside data set or by splitting a data set
Out of the several measures for goodness of fit mentioned we chose to do Chi-square, ROC curves, McFadden and Prediction by data splitting. Besides this, the sensitivity, specificity, concordance and accuracy too will be measured.
The McFadden Pseduo \(R^{2}\), this can be mathematically modelled as, (1- \(\frac{\ln{lm_{1}}}{\ln{lm_{0}}}\)); where \(\ln{lm_{1}}\) is the log likelihood of the fitted model and \(\ln{lm_{0}}\) is that of the null model. Typically, the idea value of this McFadden \(R^{2}\) lies between 0 and 1, the best or ideal possible will be about 0.40 as it is very hard to get higher value.
McFadden
0.2494976
This test is used to find out how much our model as improved. Adding lots of predictors will improve the model. Here our old model ‘model’ and new model ‘fit’ reduced the deviations by over 1000. It is a known fact that the goal of logistic regression is to minimize the deviance residuals.
Analysis of Deviance Table
Model: binomial, link: logit
Response: Y
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 1598 2209.0
ï..FA 1 14.613 1597 2194.4 0.0001320 ***
VA 1 161.169 1596 2033.2 < 2.2e-16 ***
CA 1 4.325 1595 2028.9 0.0375457 *
CL 1 11.952 1594 2016.9 0.0005458 ***
FS 1 6.470 1593 2010.4 0.0109730 *
TS 1 87.094 1592 1923.3 < 2.2e-16 ***
S 1 79.579 1591 1843.8 < 2.2e-16 ***
A 1 185.930 1590 1657.8 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Sensitivity is the number of 1’s (actuals) predicted by the model while specificity is the number of 0’s (actuals) predicted by the model. This particular data set has a sensitivity of about 71% and specificity of about 79%.
Concordance is the percentage of (1-0 pairs) whose actual positives are greater than the negatives. In general higher the concordance the better the model. This model has a concordance of about 82%.
[1] 0.7069892
[1] 0.7871345
$Concordance
[1] 0.8217789
$Discordance
[1] 0.1782211
$Tied
[1] 5.551115e-17
$Pairs
[1] 636120
0 1
0 526 182
1 218 673
One of the major measure for Goodness of fit is prediction. Spilting the data set into training and test, developing a model based on the training data set and validating it using the test data is one of the best measures accepted. In this work as a final step, we have done the same.
Out of the 1500 observations, 80% of the data that is about 1200 is used for the training while the remaining 20% that is about 300 is used for as test data. In summary:
Spilting the data set into training and test
Developing a model with the training data set
Apply the logistic regression
Calculate the Goodness of fit
Sensitivity, Specificity, Accuracy, Concordance
This is the key factor that decides the predictability of the model. Greater the Area under the curve greater will be its predictability. The main idea behind ROC is its ability to trace the true positives when the prediction probability cut off is reduced from 1 to 0. Here the AuC is about 80% which is pretty much good.
As mentioned in the previous section, there are several measures of Goodness of fit.
llh llhNull G2 McFadden r2ML
-621.3861644 -829.6153188 416.4583087 0.2509948 0.2932290
r2CU
0.3914429
This model developed has a sensitivity of 81%, specificity of about 70% and concordance of about 82%.
[1] 0.8493151
[1] 0.6722222
$Concordance
[1] 0.8207002
$Discordance
[1] 0.1792998
$Tied
[1] -2.775558e-17
$Pairs
[1] 39420
Wine Quality analysis has been done by various authors. There are 3 different works performed on the same data set. In [1], the different combinations (bi-variate and multivariate) analysis has been done. In that analysis, pair of variables and their effect on the overall quality was examined. In [2], on the same data set different types of regressions were performed: Linear, polynomial, Multiple, Logistic and prediction analysis. Here a maximum accuracy of about 70 % was obtained, while in our work we have achieved about 80 % accuracy. While in [3], on the entire data set a linear modelling was done. Different measures such as \(R^{2}\) were shown. These values seem to be comparitively lower compared to those obtained and shown by 1.
[1] https://rpubs.com/prasad_pagade/wine_quality_prediction
[2] http://rstudio-pubs-static.s3.amazonaws.com/438329_edfaab4011ce44a59fb9ae2d216d8dea.html
[3] https://www.kaggle.com/sagarnildass/red-wine-analysis-by-r
[4] https://archive.ics.uci.edu/ml/index.php
[5] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. “Modeling wine preferences by data mining from physicochemical properties”
[6] D. Smith and R. Margolskee “Making sense of taste. Scientific American,Special issue”.
[7] Dr. Chen Notes Chapter 4 “Model Adequacy Checking”.